Feature Extraction from the MovieLens 100k Dataset

MovieLens Dataset (https://grouplens.org/datasets/movielens/100k/)

100'000 movie ratings by 1'000 users for 1'700 different movies.


In [1]:
lines = sc.textFile("../data/ml-100k/u.data")
print(lines.first())
# user_id  movie_id  rating  timestamp


196	242	3	881250949

Read text file line by line. Split it at '\t', parse it and create Row objects with ratings.


In [2]:
from pyspark.mllib.recommendation import Rating
from pyspark.sql import Row
ratings_split = lines.map(lambda s: s.split("\t"))
ratings = ratings_split.map(
    lambda col: Row(userid=int(col[0]), movieid=int(col[1]), rating=float(col[2]))
)

Loead titles and movie IDs. Split this text file at |. Column 0 is the ID, column 1 is the title of the movie.


In [3]:
movies = sc.textFile("../data/ml-100k/u.item")
titles = movies.map(lambda s: s.split('|')).map(lambda line: (int(line[0]), line[1])).collectAsMap()

Show first ten ratings.


In [4]:
first10 = ratings.map(lambda r: '{:3d}'.format(r.userid) + ": "+str(r.rating) + "  "+titles[r.movieid]).take(10)
for s in first10: print(s)


196: 3.0  Kolya (1996)
186: 3.0  L.A. Confidential (1997)
 22: 1.0  Heavyweights (1994)
244: 2.0  Legends of the Fall (1994)
166: 1.0  Jackie Brown (1997)
298: 4.0  Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)
115: 2.0  Hunt for Red October, The (1990)
253: 5.0  Jungle Book, The (1994)
305: 3.0  Grease (1978)
  6: 3.0  Remains of the Day, The (1993)

Training of the ALS Model


In [5]:
from pyspark.ml.recommendation import ALS

Matrix factorization by ALS (Alternate Least Squares) using 10 iterationens and rank 10, that is, 10 latent dimensions. Use 0.01 as regularization parameter.


In [6]:
als = ALS(rank=10, maxIter=10, regParam=0.01, userCol="userid", 
          itemCol="movieid", ratingCol="rating")

Train model (fit).


In [7]:
ratingsDF = spark.createDataFrame(ratings)
model = als.fit(ratingsDF)

User factor matrix: Latent features of user 10.


In [8]:
model.userFactors.first()


Out[8]:
Row(id=10, features=[-0.8963878750801086, -1.1145066022872925, -0.3316730260848999, -0.5937343239784241, -0.29513218998908997, 1.0258771181106567, -0.5694271326065063, 0.6119299530982971, -0.2400534451007843, 0.031482260674238205])

item (movie) factor matrix: Latent features of movie with ID 10.


In [9]:
model.itemFactors.first()


Out[9]:
Row(id=10, features=[-0.5647231936454773, -1.4298157691955566, 0.30189287662506104, -0.8106299638748169, -0.23782381415367126, -0.2895015776157379, -1.5579346418380737, 0.9421285390853882, 0.005889199208468199, -0.5461520552635193])

User-specific Movie Recommendations

Ratings provided by user 51.


In [10]:
user = 51
user_ratings = ratings.filter(lambda r: r["userid"]==user).map(
    lambda r: str(r["rating"]) + ": " + titles[r["movieid"]]).collect()
for ur in user_ratings: print(ur)


3.0: GoodFellas (1990)
3.0: Ghost and the Darkness, The (1996)
4.0: Wizard of Oz, The (1939)
4.0: Shawshank Redemption, The (1994)
3.0: Conan the Barbarian (1981)
4.0: It's a Wonderful Life (1946)
1.0: My Fair Lady (1964)
5.0: Much Ado About Nothing (1993)
3.0: Rear Window (1954)
4.0: Unforgiven (1992)
3.0: Vertigo (1958)
5.0: Princess Bride, The (1987)
4.0: Indiana Jones and the Last Crusade (1989)
3.0: Army of Darkness (1993)
3.0: Stand by Me (1986)
4.0: Mr. Smith Goes to Washington (1939)
5.0: Die Hard (1988)
5.0: Empire Strikes Back, The (1980)
2.0: Citizen Kane (1941)
5.0: Star Wars (1977)
5.0: Return of the Jedi (1983)
3.0: American President, The (1995)
1.0: Singin' in the Rain (1952)

Predicted rating for this user for four arbitrarily chosen movies.


In [11]:
movies = [56, 176, 161, 179]
user_movies = spark.createDataFrame([(user, m) for m in movies], 
                                    ["userid", "movieid"])
predictions = model.transform(user_movies).rdd.map(
    lambda r: "User " +str(r["userid"]) + ": predicted rating " + 
        "{:5.3f}".format(r["prediction"]) + 
        " for " + titles[r["movieid"]]).collect()
for p in predictions: print(p)


User 51: predicted rating 4.204 for Clockwork Orange, A (1971)
User 51: predicted rating 5.535 for Top Gun (1986)
User 51: predicted rating 4.113 for Aliens (1986)
User 51: predicted rating 3.599 for Pulp Fiction (1994)

Comparison of similar movies using latent factors


In [12]:
movie = 50
print(titles[movie])


Star Wars (1977)

In [13]:
movie_features = model.itemFactors.rdd
import numpy as np
query_movie = np.array(movie_features.filter(lambda r: r.id == movie).map(lambda r: r.features).first())
print(query_movie)


[-0.22268045 -2.01642227 -0.41178048 -0.42766911  0.06418803  0.75309497
 -1.22033787  0.57380784 -0.51054227  0.63792366]

In [14]:
def cosine_similarity(a, b):
    return np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

Top-10 of the most similar movies to "Star Wars".


In [15]:
top10 = movie_features.map(lambda r: (r.id, cosine_similarity(query_movie, r.features))
                  ).sortBy(lambda r: -r[1]).map(lambda r: (titles[r[0]], r[1])).take(10)
for t, r in top10: print("{:.3}".format(r)+": "+t)


1.0: Star Wars (1977)
0.996: Empire Strikes Back, The (1980)
0.985: Amityville 1992: It's About Time (1992)
0.985: Amityville: A New Generation (1993)
0.982: Return of the Jedi (1983)
0.978: Raiders of the Lost Ark (1981)
0.955: Destiny Turns on the Radio (1995)
0.954: Groundhog Day (1993)
0.95: Star Trek: The Wrath of Khan (1982)
0.949: Fish Called Wanda, A (1988)

In [ ]: